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Abstract 


In a variety of problems originating in supervised, unsupervised, and reinforce¬ 
ment learning, the loss function is defined by an expectation over a collection 
of random variables, which might be part of a probabilistic model or the exter¬ 
nal world. Estimating the gradient of this loss function, using samples, lies at 
the core of gradient-based learning algorithms for these problems. We introduce 
the formalism of stochastic computation graphs —directed acyclic graphs that in¬ 
clude both deterministic functions and conditional probability distributions—and 
describe how to easily and automatically derive an unbiased estimator of the loss 
function’s gradient. The resulting algorithm for computing the gradient estimator 
is a simple modification of the standard backpropagation algorithm. The generic 
scheme we propose unifies estimators derived in variety of prior work, along with 
variance-reduction techniques therein. It could assist researchers in developing in¬ 
tricate models involving a combination of stochastic and deterministic operations, 
enabling, for example, attention, memory, and control actions. 

1 Introduction 

The great success of neural networks is due in part to the simplicity of the backpropagation al¬ 
gorithm, which allows one to efficiently compute the gradient of any loss function defined as a 
composition of differentiable functions. This simplicity has allowed researchers to search in the 
space of architectures for those that are both highly expressive and conducive to optimization; yield¬ 
ing, for example, convolutional neural networks in vision [12] and LSTMs for sequence data [9], 
However, the backpropagation algorithm is only sufficient when the loss function is a deterministic, 
differentiable function of the parameter vector. 

A rich class of problems arising throughout machine learning requires optimizing loss functions 
that involve an expectation over random variables. Two broad categories of these problems are (1) 
likelihood maximization in probabilistic models with latent variables [17, 18], and (2) policy gradi¬ 
ents in reinforcement learning [5, 23, 26]. Combining ideas from from those two perennial topics, 
recent models of attention [15] and memory [29] have used networks that involve a combination of 
stochastic and deterministic operations. 

In most of these problems, from probabilistic modeling to reinforcement learning, the loss functions 
and their gradients are intractable, as they involve either a sum over an exponential number of latent 
variable configurations, or high-dimensional integrals that have no analytic solution. Prior work (see 
Section 6) has provided problem-specific derivations of Monte-Carlo gradient estimators, however, 
to our knowledge, no previous work addresses the general case. 

Appendix C recalls several classic and recent techniques in variational inference [14, 10, 21] and re¬ 
inforcement learning [23, 25, 15], where the loss functions can be straightforwardly described using 
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the formalism of stochastic computation graphs that we introduce. For these examples, the variance- 
reduced gradient estimators derived in prior work are special cases of the results in Sections 3 and 4. 

The contributions of this work are as follows: 

• We introduce a formalism of stochastic computation graphs, and in this general setting, we derive 
unbiased estimators for the gradient of the expected loss. 

• We show how this estimator can be computed as the gradient of a certain differentiable function 
(which we call the surrogate loss), hence, it can be computed efficiently using the backpropaga- 
tion algorithm. This observation enables a practitioner to write an efficient implementation using 
automatic differentiation software. 

• We describe variance reduction techniques that can be applied to the setting of stochastic compu¬ 
tation graphs, generalizing prior work from reinforcement learning and variational inference. 

• We briefly describe how to generalize some other optimization techniques to this setting: 
majorization-minimization algorithms, by constructing an expression that bounds the loss func¬ 
tion; and quasi-Newton / Hessian-free methods [13], by computing estimates of Hessian-vector 
products. 

The main practical result of this article is that to compute the gradient estimator, one just needs 
to make a simple modification to the backpropagation algorithm, where extra gradient signals are 
introduced at the stochastic nodes. Equivalently, the resulting algorithm is just the backpropagation 
algorithm, applied to the surrogate loss function, which has extra terms introduced at the stochastic 
nodes. The modified backpropagation algorithm is presented in Section 5. 


2 Preliminaries 

2.1 Gradient Estimators for a Single Random Variable 


This section will discuss computing the gradient of an expectation taken over a single random 
variable—the estimators described here will be the building blocks for more complex cases with 
multiple variables. Suppose that a: is a random variable, / is a function (say, the cost), and we are 
interested in computing -§gE x [f(x)]. There are a few different ways that the process for generating 
x could be parameterized in terms of 9, which lead to different gradient estimators. 


• We might be given a parameterized probability distribution x ~ ]>{■'. 9). In this case, we can use 
the score function (SF) estimator [3]: 


d_ 

89 


E * [f( x )\ = E z 


f{x)^\ogp( x ] 9) 


(1) 


This classic equation is derived as follows: 
8 
8t9 


[f( x )\ = ^ Jdxp(x ; 9)f{ x) = Jdx x; 9)f{x) 


= Jdx p{x\ 9) \ogp{x\ 9)f(x)=E x 


f{ X )gjj l °gP( X \ 0) 


( 2 ) 


This equation is valid if and only if p(x; 9) is a continuous function of 9; however, it does not 
need to be a continuous function of x [4]. 

• x may be a deterministic, differentiable function of 9 and another random variable z, i.e., we can 
write x(z, 9). Then, we can use the patliwise derivative (PD) estimator, defined as follows. 


8_ 

89 


E z [f(x{z,0))]=E s 


8_ 

89 


f( x (z,0)) 


(3) 


This equation, which merely swaps the derivative and expectation, is valid if and only if f(x(z, 9)) 
is a continuous function of 9 for all z [4]. 1 That is not true if, for example, / is a step function. 

1 Note that for the pathwise derivative estimator, f(x(z,8)) merely needs to be a continuous function of 
8 —it is sufficient that this function is almost-everywhere differentiable. A similar statement can be made 
about p(x; 8) and the score function estimator. See Glasserman [4] for a detailed discussion of the technical 
requirements for these gradient estimators to be valid. 
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• Finally 0 might appear both in the probability distribution and inside the expectation, e.g., in 
e) [f(x(z, 0))]. Then the gradient estimator has two terms: 


d_ 

d8 


E, 


-p(-; 9 ) [f(x(z, 0))] = E,, 


■p(-\ 6) 


' d 
dO 


f( x {z 7 0)) 


dO 


log p(z-,O))f(x{z,0)) 


(4) 


This formula can be derived by writing the expectation as an integral and differentiating, as in 
Equation (2). 


In some cases, it is possible to reparameterize a probabilistic model—moving 0 from the distribution 
to inside the expectation or vice versa. See [3] for a general discussion, and see [10, 21] for a recent 
application of this idea to variational inference. 

The SF and PD estimators are applicable in different scenarios and have different properties. 


1. SF is valid under more permissive mathematical conditions than PD. SF can be used if / is 
discontinuous, or if a; is a discrete random variable. 

2. SF only requires sample values /( x), whereas PD requires the derivatives f'{x). In the context 
of control (reinforcement learning), SF can be used to obtain unbiased policy gradient estimators 
in the “model-free” setting where we have no model of the dynamics, we only have access to 
sample trajectories. 

3. SF tends to have higher variance than PD, when both estimators are applicable (see for instance 
[3, 21]). The variance of SF increases (often linearly) with the dimensionality of the sampled 
variables. Hence, PD is usually preferable when x is high-dimensional. On the other hand, PD 
has high variance if the function / is rough, which occurs in many time-series problems due to 
an “exploding gradient problem” / “butterfly effect”. 

4. PD allows for a deterministic limit, SF does not. This idea is exploited by the deterministic policy 
gradient algorithm [22] . 


Nomenclature. The methods of estimating gradients of expectations have been independently pro¬ 
posed in several different fields, which use differing terminology. What we call the score function 
estimator (via [3]) is alternatively called the likelihood ratio estimator [5] and REINFORCE [26], 
We chose this term because the score function is a well-known object in statistics. What we call 
the pathwise derivative estimator (from the mathematical finance literature [4] and reinforcement 
learning [16]) is alternatively called infinitesimal perturbation analysis and stochastic backpropa- 
gation [21]. We chose this term because pathwise derivative is evocative of propagating a derivative 
through a sample path. 


2.2 Stochastic Computation Graphs 

The results of this article will apply to stochastic computation graphs, which are defined as follows: 


Definition 1 (Stochastic Computation Graph). A directed, acyclic graph, with three types of 
nodes: 

1. Input nodes, which are set externally, including the parameters we differentiate with 
respect to. 

2. Deterministic nodes, which are functions of their parents. 

3. Stochastic nodes, which are distributed conditionally on their parents. 

Each parent v of a non-input node w is connected to it by a directed edge {v, w). 

In the subsequent diagrams of this article, we will use circles to denote stochastic nodes and squares 
to denote deterministic nodes, as illustrated below. The structure of the graph fully specifies what 
estimator we will use: SF, PD, or a combination thereof. This graphical notation is shown below, 
along with the single-variable estimators from Section 2.1. 
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Gives SF estimator 


Gives PD estimator 


2.3 Simple Examples 

Several simple examples that illustrate the stochastic computation graph formalism are shown below. 
The gradient estimators can be described by writing the expectations as integrals and differentiating, 
as with the simpler estimators from Section 2.1. However, they are also implied by the general 
results that we will present in Section 3. 



Stochastic Computation Graph 

Objective 

Gradient Estimator 

(1) 


X 

->(y) -> 

/ 

E„[/(y)] 

|^ iog P (t/i *)/(») 

(2) 

e -*0) 

222 ^ y ;j, f/ jjj 

/ 

Ex lf{y{x))] 

^\ogp(x\6)f(y(x)) 

(3) 

e - 

^0 -> 

/ 

Ex ,y [f(y)] 

^ log p(x \ 6) f(y) 



Figure 1: Simple stochastic computation graphs 


These simple examples illustrate several important motifs, where stochastic and deterministic nodes 
are arranged in series or in parallel. For example, note that in (2) the derivative of y does not appear 
in the estimator, since the path from 9 to / is “blocked” by x. Similarly, in (3), p(y \ x ) does not 
appear (this type of behavior is particularly useful if we only have access to a simulator of a system, 
but not access to the actual likelihood function). On the other hand, (4) has a direct path from 9 to 
/, which contributes a term to the gradient estimator. (5) resembles a parameterized Markov reward 
process, and it illustrates that we’ll obtain score function terms of the form grad log-probability x 
future costs. 

The examples above all have one input 9, but the formal¬ 
ism accommodates models with multiple inputs, for ex- w, w 2 b 2 y =iabei 

ample a stochastic neural network with multiple layers of 
weights and biases, which may influence different sub¬ 
sets of the stochastic and cost nodes. See Appendix C x 
for nontrivial examples with stochastic nodes and multi¬ 
ple inputs. The figure on the right shows a deterministic 
computation graph representing classification loss for a two-layer neural network, which has four 
parameters {W\, bi, W 2 , 62 ) (weights and biases). Of course, this deterministic computation graph 
is a special type of stochastic computation graph. 
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3 Main Results on Stochastic Computation Graphs 

3.1 Gradient Estimators 

This section will consider a general stochastic computation graph, in which a certain set of nodes 
are designated as costs , and we would like to compute the gradient of the sum of costs with respect 
to some input node 9. 

In brief, the main results of this section are as follows: 

1. We derive a gradient estimator for an expected sum of costs in a stochastic computation graph. 
This estimator contains two parts (1) a score function part, which is a sum of terms grad log- 
prob of variable X sum of costs influenced by variable ; and (2) a pathwise derivative term, that 
propagates the dependence through differentiable functions. 

2. This gradient estimator can be computed efficiently by differentiating an appropriate “surrogate” 
objective function. 

Let 0 denote the set of input nodes, V the set of deterministic nodes, and S the set of stochastic 
nodes. Further, we will designate a set of cost nodes C, which are scalar-valued and deterministic. 
(Note that there is no loss of generality in assuming that the costs are deterministic—if a cost is 
stochastic, we can simply append a deterministic node that applies the identity function to it.) We 
will use 9 to denote an input node (9 £ 0) that we differentiate with respect to. In the context of 
machine learning, we will usually be most concerned with differentiating with respect to a parameter 
vector (or tensor), however, the theory we present does not make any assumptions about what 9 
represents. 

For the results that follow, we need to define the 
notion of “influence”, for which we will introduce 
two relations -< and -<". The relation u -< w 
(“v influences w”) means that there exists a se¬ 
quence of nodes ai, a, 2 ,..., ax, with K > 0, such 
that (v,a 1 ),(a 1 ,a 2 ),...,(a K _ 1 ,a K ),(a K ,w) are 
edges in the graph. The relation v -< D w (“v deter¬ 
ministically influences w”) is defined similarly, ex¬ 
cept that now we require that each a,k is a determin¬ 
istic node. For example, in Figure 1, diagram (5) 
above, 9 influences {xi,X 2 , /i, ./g}, but it only de¬ 
terministically influences 

Next, we will establish a condition that is sufficient 
for the existence of the gradient. Namely, we will stipulate that every edge ( v , w) with w lying in 
the “influenced” set of 9 corresponds to a differentiable dependency: if w is deterministic, then the 
Jacobian ^ must exist; if w is stochastic, then the probability mass function p(w \ v,... ) must be 
differentiable with respect to v. 

More formally: 


Condition 1 (Differentiability Requirements). Given input node 9 e 0, for all edges (v, w) 
which satisfy 9 -< D v and 9 -< D w, then the following condition holds: if w is deterministic, 
Jacobian ^ exists, and if w is stochastic, then the derivative of the probability mass function 
■§^p{w I PARENTS^,) exists. 

Note that 1 does not require that all the functions in the graph are differentiable. If the path from 
an input 9 to deterministic node v is blocked by stochastic nodes, then v may be a nondifferentiable 
function of its parents. If a path from input 9 to stochastic node v is blocked by other stochastic 
nodes, the likelihood of v given its parents need not be differentiable; in fact, it does not need to be 
known 2 . 


2 This fact is particularly important for reinforcement learning, allowing us to compute policy gradient esti¬ 
mates despite having a discontinuous dynamics function or reward function. 


Notation Glossary 

0: Input nodes 

V\ Deterministic nodes 

S : Stochastic nodes 

C: Cost nodes 

v -< w: v influences w 

v -< D w. v deterministically influences w 

DEPS„: “dependencies”, 

{w 6 0 U S | w -< D v} 

Q v : sum of cost nodes influenced by v. 

0: denotes the sampled value of the node v. 
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We need a few more definitions to state the main theorems. Let DEPS„ : = { w £ 0 U S | w < ri v}, 
the “dependencies” of node v, i.e., the set of nodes that deterministically influence it. Note the 
following: 

• If v € S, the probability mass function of v is a function of DEPS„, i.e., we can write p(v | DEPS„). 

• If v € T>, v is a deterministic function of DEPS„, so we can write v(DEPS„). 

Let Q v := ^ c yv, c, i.e., the sum of costs downstream of node v. These costs will be treated as 

cec 

constant, fixed to the values obtained during sampling. In general, we will use the hat symbol v to 
denote a sample value of variable v, which will be treated as constant in the gradient formulae. 

Now we can write down a general expression for the gradient of the expected sum of costs in a 
stochastic computation graph: 


Theorem 1. Suppose that 0 £ 0 satisfies 1. Then the following two equivalent equations hold: 





Elog p( w | DEPSu,) 


E c 

.cec . 

= E 

£( 

A ^ ^ 

\Qw+ E ^ C ( DEPS =) 

cec 



.e^ D w 


e^ D c 


= E 


E 5 E logp(u> | DEPSj,,) + ^ E c(depSc ) 

cec w^c, cec, 

e^. D w e^, D c 


Proof: See Appendix A. 


(5) 


( 6 ) 


The estimator expressions above have two terms. The first term is due to the influence of 6 on proba¬ 
bility distributions. The second term is due to the influence of 6 on the cost variables through a chain 
of differentiable functions. The distribution term involves a sum of gradients times “downstream” 
costs. The first term in Equation (5) involves a sum of gradients times “downstream” costs, whereas 
the first term in Equation (6) has a sum of costs times “upstream” gradients. 


3.2 Surrogate Loss Functions 

The next corollary lets us write down a “surrogate” objective L, 
which is a function of the inputs that we can differentiate to obtain 
an unbiased gradient estimator. 

Corollary 1. Let L(0,S) := log p[w | DEPS W )Q W + 

Y ^ c^C c(deps c ). Then differentiation of L gives us an unbiased gra¬ 
dient estimate: J^E [X^ceC c] = E [JgL(©,<S)] . 

One practical consequence of this result is that we can apply a stan¬ 
dard automatic differentiation procedure to L to obtain an unbiased 
gradient estimator. In other words, we convert the stochastic com¬ 
putation graph into a deterministic computation graph, to which we 
can apply the backpropagation algorithm. 

There are several alternative ways to define the surrogate objective 
function that give the same gradient as L from Corollary 1. We 
could also write L(0, S) := w P{w 1 ° EFS "' ) Q w +J2 c eC c(DEPS e ), 

where P w is the probability p(w \ DEPS^,) obtained during sampling, 
which is viewed as a constant. 


Surrogate Loss Computation Graph 


(l) e 


( 2 ) 


(3) 


(4) 


(5) 



Figure 2: Deterministic compu¬ 
tation graphs obtained as surro¬ 
gate loss functions of stochas¬ 
tic computation graphs from Fig¬ 
ure 1. 


The surrogate objective from Corollary 1 is actually an upper bound 
on the true objective in the case that (1) all costs c £ C are negative, 

(2) the the costs are not deterministically influenced by the parameters 0. This construction al¬ 
lows from majorization-minimization algorithms (similar to EM) to be applied to general stochastic 
computation graphs. See Appendix B for details. 
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3.3 Higher-Order Derivatives. 


The gradient estimator for a stochastic computation graph is itself a stochastic computation graph. 
Hence, it is possible to compute the gradient yet again (for each component of the gradient vector), 
and get an estimator of the Hessian. For most problems of interest, it is not efficient to compute 
this dense Hessian. On the other hand, one can also differentiate the gradient-vector product to get 
a Hessian-vector product—this computation is usually not much more expensive than the gradient 
computation itself. The Hessian-vector product can be used to implement a quasi-Newton algo¬ 
rithm via the conjugate gradient algorithm [28], A variant of this technique, called Hessian-free 
optimization [13], has been used to train large neural networks. 


4 Variance Reduction 

Consider estimating [f(x)]. Clearly this expectation is unaffected by subtracting a con¬ 

stant b from the integrand, which gives ^E x ^ p (.. g) [/( x) — b\. Taking the score function estimator, 
we get ^E^pp. e) [/( x)\ = E x ^ p p. g) [§$ log p(x; 9){f(x) - b)]. Taking b = E x [f(x)} gener¬ 
ally leads to substantial variance reduction —b is often called a baseline 3 (see [6] for a more thorough 
discussion of baselines and their variance reduction properties). 

We can make a general statement for the case of stochastic computation graphs—that we can 
add a baseline to every stochastic node, which depends all of the nodes it doesn’t influence. Let 

NonInfluenCED(u) := {w I v 7*5; u>}. 


Theorem 2. 



E c 

_cec . 


= E 


/ d \ , Q 

£ gg lOgp(V | PARENTS,,) (Q V - 6( NONlNFLUENC E D (u)) + ^2 Q^ C 

ves ' ' cecye 

.vye 


Proof: See Appendix A. 


5 Algorithms 

As shown in Section 3, the gradient estimator can be obtained by differentiating a surrogate objective 
function L. Hence, this derivative can be computed by performing the backpropagation algorithm 
on L. That is likely to be the most practical and efficient method, and can be facilitated by automatic 
differentiation software. 


Algorithm 1 shows explicitly how to compute the gradient estimator in a backwards pass through 


the stochastic computation graph. The algorithm will recursively compute g v 
every deterministic and input node v. 


#E 

OV 


J2cec c 

1>-<C 


at 


6 Related Work 


As discussed in Section 2, the score function and pathwise derivative estimators have been used in a 
variety of different fields, under different names. See [3] for a review of gradient estimation, mostly 
from the simulation optimization literature. Glasserman’s textbook provides an extensive treatment 
of various gradient estimators and Monte Carlo estimators in general. Griewank and Walther’s 
textbook [8] is a comprehensive reference on computation graphs and automatic differentiation (of 
deterministic programs.) The notation and nomenclature we use is inspired by Bayes nets and 
influence diagrams [19]. (In fact, a stochastic computation graph is a type of Bayes network; where 
the deterministic nodes correspond to degenerate probability distributions.) 


The topic of gradient estimation has drawn significant recent interest in machine learning. Gradients 
for networks with stochastic units was investigated in Bengio et al. [2], though they are concerned 


3 The optimal baseline for scalar 9 is in fact the weighted expectation 

4 l°g P(x; 9). 


Ei [/(aOs(x ) 2 
E x [s(a;) 2 ] 


where s(x) = 
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Algorithm 1 Compute Gradient Estimator for Stochastic Computation Graph 


for v € Graph do 


gv 


{ t-dim v 
Odim v 


if V G C 
otherwise 


end for 

Compute Q w for all nodes w € Graph 
for v in ReverseTopologicalSort(NonInputs) do 
for w G PARENTS,, do 

if not IsStochastic(w) then 
if IsStochastic(w) then 

gw += log p(v I parents,,))^™ 
else 

Sw += (§^,) T Sv 

end if 
end if 
end for 
end for 
return [g e ] ee0 


> Initialization at output nodes 


> Reverse traversal 


with differentiating through individual units and layers; not how to deal with arbitrarily structured 
models and loss functions. Kingma and Welling [11] consider a similar framework, although only 
with continuous latent variables, and point out that reparameterization can be used to to convert 
hierarchical Bayesian models into neural networks, which can then be trained by backpropagation. 

The score function method is used to perform variational inference in general models (in the context 
of probabilistic programming) in Wingate and Weber [27], and similarly in Ranganath et al. [20]; 
both papers mostly focus on mean-field approximations without amortized inference. It is used to 
train generative models using neural networks with discrete stochastic units in Mnih and Gregor [14] 
and Gregor et al. in [7]; both amortize inference by using an inference network. 

Generative models with continuous valued latent variables networks are trained (again using an 
inference network) with the reparametrization method by Rezende, Mohamed, and Wierstra [21] and 
by Kingma and Welling [10]. Rezende et al. also provide a detailed discussion of reparameterization, 
including a discussion comparing the variance of the SF and PD estimators. 

Bengio, Leonard, and Courville [2] have recently written a paper about gradient estimation in neural 
networks with stochastic units or non-differentiable activation functions—including Monte Carlo 
estimators and heuristic approximations. The notion that policy gradients can be computed in mul¬ 
tiple ways was pointed out in early work on policy gradients by Williams [26]. However, all of this 
prior work deals with specific structures of the stochastic computation graph and does not address 
the general case. 

7 Conclusion 

We have developed a framework for describing a computation with stochastic and deterministic 
operations, called a stochastic computation graph. Given a stochastic computation graph, we can 
automatically obtain a gradient estimator, given that the graph satisfies the appropriate conditions 
on differentiability of the functions at its nodes. The gradient can be computed efficiently in a 
backwards traversal through the graph: one approach is to apply the standard backpropagation al¬ 
gorithm to one of the surrogate loss functions from Section 3; another approach (which is roughly 
equivalent) is to apply a modified backpropagation procedure shown in Algorithm 1. The results we 
have presented are sufficiently general to automatically reproduce a variety of gradient estimators 
that have been derived in prior work in reinforcement learning and probabilistic modeling, as we 
show in Appendix C. We hope that this work will facilitate further development of interesting and 
expressive models. 
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A Proofs 

Theorem 1 


We will consider the case that all of the random variables are continuous-valued, thus the expecta¬ 
tions can be written as integrals. For discrete random variables, the integrals should be changed to 
sums. 


Recall that we seek to compute [X] cS c c ] • We will differentiate the expectation of a single cost 
term; summing over these terms yields Equation (6). 


Efss, [c] 

V -*<C 


i 

ae 


E„es, 

V -*<C 


[c] 


/ n P( v I DEPS„)du c(DEPS c ) (7) 

•* vGS, 


d 

We 


V -<c 



7 n 

p(v | DEPS„)di; c 

u6 S, 

V^C 



n p ( v 

DEPS„)du 

E 

v£S, 


w£S, 

v^c 


. w^c 

n p ( v 

DEPS„)du 

E 

v£S, 

V -<c 


w£S, 
. u>-<c 


9e p{w I DEPS ^) e( - DE p S j + 7L( DEPS ) 

p(w | DEPSy,) 89 


Y [§q 1o ZP( W I DEPS^) Jc(DEPS c ) + J^c(DEPS c ) 

onCZ.C \ / 


( 8 ) 

(9) 


E„ e s, 

V^C 


Y lo g p( w I DEPS.)c + |c(DEPS c ) 

iu£«S, 

_ w^c 


( 10 ) 

( 11 ) 


Equation (9) requires that the integrand is differentiable, which is satisfied if all of the PDFs and 
c(deps c ) are differentiable. Equation (6) follows by summing over all costs c £ C. Equation (5) 
follows from rearrangement of the terms in this equation. 


Theorem 2 


It suffices to show that for a particular node v £ S, the following expectation (taken over all vari¬ 
ables) vanishes 


E 



log p{v | PARENTS,,) 


6(NonInfluenced(i>)) 


( 12 ) 


Analogously to NonInfluenced(u), define Influenced (v) := {u> | w y v}. Note that the 
nodes can be ordered so that NonInfluenced(u) all come before v in the ordering. Thus, we 
can write 


E 


NonIn FLUENCED(li) 


E 


Influenced^) 


— log p(v | PARENTS,,) )6(NONINFLUENCED(u)) 
Ou ) 


(13) 


— EnoNINFLUENCED(d) 


E 


INFLUENCED(d) 


— log p(v I PARENTS,,) 


6(NonInfluenced(u)) 


= E No nINFLUENCED(d) [0'6(NONlNFLUENCED (v))] 
= 0 


(14) 

(15) 

(16) 


where we used E Influenced( „) [(^ log p(v | PARENTS,,))] = E„ [(-§g log p(v | PARENTS,,))] = 0. 


B Surrogate as an Upper Bound, and MM Algorithms 

L has additional significance besides allowing us to estimate the gradient of the expected sum of 
costs. Under certain conditions, L is a upper bound on on the true objective (plus a constant). 
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We shall make two restrictions on the stochastic computation graph: (1) first, that all costs c £ C 
are negative. (2) the the costs are not deterministically influenced by the parameters 0. First, let 
us use importance sampling to write down the expectation of a given cost node, when the sampling 
distribution is different from the distribution we are evaluating: for parameter 9 £ 0, 9 = 0,,m is 
used for sampling, but we are evaluating at 9 = 9 new . 


E t^c I 0 new [c] — E «^:e I Sold 


< E 


«-<C I Sold 


£ TT Pv{y | DEPS lf \$, $ne W ) 

C v-}c, P v( V I DEPS AMold) 
e^ D v 


( 

( 

\ 

log 

n 

Pv(v | DEPS v \0, 6> new ) 

p v (v I DEPS^Moid) 

V 

\e-< 0 v 

/ 


V 

/. 


(17) 


(18) 


where the second line used the inequality x > log x + 1, and the sign is reversed since c is negative. 
Summing over c £ C and rearranging we get 


E S | S ne „ 


.cec . 


^ E S | S„id 


£fi+£iog 

_cGC v£S 


{ p(v I DEPS„\6>,6> new ) \ - 
\p(v I DEPS„\0, 6»old) r v 


= E s | Sold 


^ ^ log p('V | DEPS„\0, Oriew)Qv 
.vES 


+ const. 


(19) 

( 20 ) 


Equation (20) allows for majorization-minimization algorithms (like the EM algorithm) to be used 
to optimize with respect to 0. In fact, similar equations have been derived by interpreting rewards 
(negative costs) as probabilities, and then taking the variational lower bound on log-probability (e.g., 
[24]). 


C Examples 

This section considers two settings where the formalism of stochastic computation graphs can be 
applied. First, we consider the generalized EM algorithm for maximum likelihood estimation in 
probabilistic models with latent variables. Second, we consider reinforcement learning in Markov 
Decision Processes. In both cases, the objective function is given by an expectation; writing it out 
as a composition of stochastic and deterministic steps yields a stochastic computation graph. 

C.l Generalized EM Algorithm and Variational Inference. 

The generalized EM algorithm maximizes likelihood in a probabilistic model with latent variables 
[18]. We start with a parameterized probability density p(x, z: 9) where x is observed, 2 is a latent 
variable, and 9 is a parameter of the distribution. The generalized EM algorithm maximizes the 
variational lower bound, which is defined by an expectation over z for each sample x : 

L(9,q) = E z ^ q 

As parameters will appear both in the probability density and inside the expectation, stochastic 
computation graphs provide a convenient route for deriving the gradient estimators. 

Neural variational inference. [14] propose a general¬ 
ized EM algorithm for multi-layered latent variable mod¬ 
els that employs an inference network , an explicit param¬ 
eterization of the posterior q ( /,(z \ x) ~ p(z \ x), to allow 
for fast approximate inference. The generative model and 
inference network take the form 



log 


p{x,z]9) 

9{z) 


( 21 ) 
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Pe{x) = ^2 P8i{ x \hi)pe 2 (hi\h2)pe 3 (h2\h 3 )pe 3 (h 3 ) 

hi ,h 2 

q<l>(hl,h2\x) = 90! (/il |cc )^ 2 (/l 2 (*3|*2)- 

The inference model is used for sampling, i.e., we sample h\ ~ q^, 1 (• | a;), h 2 ~ (• | hi), h 3 ~ 

903 (■ \h 2 ). The stochastic computation graph is shown above. 

j PeA x \hi) + j Pe 2 (hi\h 2 ) + Pe 3 (h 2 \h 3 ) P g 3 (h 3 ) 

S 90 X & q^> 2 (h 2 \hi) 903 (^ 3 ^ 2 ) 

V- v -✓ N-v-' s -v-' 

=r 1 =r 2 =r 3 

Given a sample h ~ an unbiased estimate of the gradient is given by Theorem 2 as 
dL d d d 

~dd ~ + ^pogP^ilM + qq logpe 3 {h 2 ) ( 22 ) 

dL d 

q ~ ^log 90 i(/ii|®)(Qi - 6 i(x)) 

r| ^ 

+ log 902(^2 1)(Q 2 - b 2 {hi)) + — log 90 3 (/i 3 |/i 2 )((33 - b 3 (h 2 )) (23) 

where Qi = r 3 + r 2 + r 3 \ Q 2 = r 2 + r 3 \ and Q 3 = r 3 , and bi,b 2 ,b 3 are baseline functions. 


L{0 , <j)) = E h^ q4 . 


Variational Autoencoder, Deep Latent Gaussian Mod¬ 
els and Reparameterization. Here we ll note out that 
in some cases, the stochastic computation graph can be 
transformed to give the same probability distribution for 
the observed variables, but one obtains a different gradi¬ 
ent estimator. Kingma and Welling [10] and Rezende et 
al. [21] consider a model that is similar to the one pro¬ 
posed by Mnih et al. [14] but with continuous latent vari¬ 
ables, and they re-parameterize their inference network to 
enable the use of the PD estimator. The original objective, 
the variational lower bound, is 


£orig(0, </>) = E hr- 


log 


p e {x\h)p e {h) 

q<p(h\ x ) 


(24) 


The second term, the entropy of q$, can be computed an¬ 
alytically for the parametric forms of q considered in the 
paper (Gaussians). For q^ being conditionally Gaussian, 
i.e. q$(h\x) = N(h\p ( / ) (x),a ( /,(x)) re-parameterizing 
leads to h = hj,(e; x) = Pc/>{x) + ea t / > (x), giving 

L re (9, <j>) = E e ^ p [logp e (x|/i0(e, x)) + log p e (h$(e, x))] 
+ H[q^-\x)\. (25) 


<t> e 



The stochastic computation graph before and after reparameterization is shown above. Given e 
an estimate of the gradient is obtained as 

dL d 

[logpe(x|/i0(e, x)) + \ogpg(h<p(e, x))], 


d6 

dL ie 

d4> 


dd 


d d 

— Iogp e (x|/i0(e,x)) + — log p 9 (h^{e,x)) 


dh d 
j 04, + 04 , 


~ P 

(26) 

(27) 


C.2 Policy Gradients in Reinforcement Learning. 

In reinforcement learning, an agent interacts with an environment according to its policy 7 r, and the 
goal is to maximize the expected sum of rewards, called the return. Policy gradient methods seek 
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to directly estimate the gradient of expected return with respect to the policy parameters [26, 1, 23], 
In reinforcement learning, we typically assume that the environment dynamics are not available 
analytically and can only be sampled. Below we distinguish two important cases: the Markov 
decision process (MDP) and the partially observable Markov decision process (POMDP). 


MDPs: In the MDP case, the expectation is taken with 
respect to the distribution over state (s) and action (a) se¬ 
quences 


m =Et~ w 


" T 

5Z r ( s ^ a «) > 

_t= 1 


(28) 


where r = (si, ai, S 2 , 02 , ■ • ■) are trajectories and the 
distribution over trajectories is defined in terms of the en¬ 
vironment dynamics pE{st +1 | St, a*) and the policy ng: 
Pe{r) = p E (si)Yl t Trg(a t | s t )pE{s t +i \ s t ,a t ). r are 
rewards (negative costs in the terminology of the rest of 
the paper). The classic REINFORCE [26] estimate of the 
gradient is given by 





'Pe 


t=1 


(29) 


where b t (st) is an arbitrary baseline which is often chosen to approximate Vt(st) = 


E t „ 


•P e 




, i.e. the state-value function. Note that the stochastic action nodes at 


“block” the differentiable path from 6 to rewards, which eliminates the need to differentiate through 
the unknown environment diynamics. 


POMDPs. 


POMDPs differ from MDPs in that the state s t of the envi¬ 
ronment is not observed directly but, as in latent-variable 
time series models, only through stochastic observations 
Ot, which depend on the latent states St via pE^Ot \ St). 
The policy therefore has to be a function of the history of 
past observations TTg(a t \ 0 \... o t ). Applying Theorem 2, 
we obtain a gradient estimator: 


§s L = E 


r^pe[j2-^^ogTr g (a t |oi...o t )) 


r{s t ',a t ') - 6 t (oi ...o t ) 


(30) 


Here, the baseline b t and the policy 7 Tg can depend on the 
observation history through time t, and these functions 
can be parameterized as recurrent neural networks [25, 

15]. 
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